05_7_A_TS_CrossValidation

References

NOTE: the following material not fully original, but rather an extension of reference 1 with comments and other sources to improve the student’s understanding:

  1. Hyndman, R.J., & Athanasopoulos, G., 2021. Forecasting: principles and practice. 3rd edition.
  2. Fable package documentation
  • https://fable.tidyverts.org/index.html
  • https://fable.tidyverts.org/articles/fable.html

Packages

library(fpp3)

Time Series Cross-Validation

Introduction to the concept

Evaluating accuracy on a single test-train split yield results that are very dependent on the particular train-test split performed.

To achieve more statistically robust results, we may:

  1. Compute error metrics on many different train-test splits
  2. Average the results obtained for each split.

This is simple idea is what we refer to as cross-validation.

Cross-validation in time series

Cross-validation applied to time series consists in:

  1. Generating a series of train-test splits. These splits must be done considering the aspects below:
    • These splits must respect the time-order structure of the data. Unlike in other areas of ML, where the train-test splits are done based on different random sampling techniques, here the time-structure of the data must be respected.
    • Training sets need a minimum size \(\rightarrow\) earliest observations cannot be used as test sets or the training set would be too small.
    • The smallest training dataset should be at least as large as the maximum forecast horizon required. If possible the smallest training dataset should be bigger than this.
    • The training set shall only contain observations that occurred prior to the observations of the test set.
    • All the test sets must have the same length and the must be placed in time after the training set.
  2. Compute the point forecast accuracy metrics for each of the different train-test splits generated and average them.

Again, it is essential to respect the time-structure of the data when performing these splits.

Visualizing the train-test splits:

Problem: given a time-series, we are to fit multiple models and evaluate their performance using cross-validation. Input: the corresponding values of the time series \(y_t\) at a series of points in time depicted below in grey. For simplicity, I only depict the points in time, without depicting the values \(y_t\):

Instead of splitting this time series in a single train-test split, we are going to generate multiple train-test splits based on this original time series.

Example 1: one-step ahead forecasts

Let us start with a simple example, in which:

  1. Train tests have an initial size and grow one observatin at a time.
  2. Test sets consist in a single observation and a single forecast horizon h = 1.

The diagram below represents this situation:

  • Blue observations represent the training sets
  • Orange observations represent the test sets.

Performing cross-validation based on this series of train-test sets means:

  1. Fit the models to each of the training sets.
  2. For each model fitted to each training set, evaluate the forecast errors on their corresponding test set.
  3. Compute the average of the resulting forecast errors over all the train-test splits.

Result: average errors over all the training datasets of each model for the forecast horizon being considered (in the example h = 1)

Example 2: multi-step forecasts

For multi-step forecasts, the same strategy is followed. This time the test set is at a distance h (the forecast horizon) from the train set.

Suppose we are interested in models that produce good 4-step-ahead forecasts. The corresponding train-test splits would be:

Again, performing cross-validation based on this series of train-test sets means:

  1. Fit the models to each of the training sets.
  2. For each model fitted to each training set, evaluate the forecast errors on their corresponding test set.
  3. Compute the average of the resulting forecast errors over all the train-test splits.

Result: average errors over all the training datasets of each model for the forecast horizon being considered (in the example h = 4)

Example 3: cross-validation for multiple forecast horizons

In practice we will be interested in analyzing which model perform best for a series of forecast horizons, not just for a specific forecast horizon. For this purpose we need train-test splits that look like those in the figure below. The difference is that now the test-sets include more than one forecast horizon.

  • With the splits in the example, we could analyze forecast horizons of up to h=4.

With these train-test splits, cross validation can be performed at 2 levels:

  1. Performance of the different models for each forecast horizon separately.
    • Result: average performance metrics over all training datasets for each combination of:
      • Model.
      • Forecast horizon.
  2. Performance of the different models on average for all forecast horizons.
    • Result: average performance metrics over all training datasets and forecast horizons for each model.

Option 1: performance of the different models for each forecast horizon separately

  1. Fit the models to each of the training sets.
  2. For each model fitted to each training set, evaluate the forecast errors on their corresponding test set, separately for each forecast horizon.
  3. For each combination of model and forecast horizon, compute the average of the resulting errors over all the train-test splits.

Result: average errors of each model over all the training datasets for every forecast horizon being considered (in the example for h = 1, 2, 3 and 4).

  • That is, we would obtain h results per model (average error for each forecast horizon).

Option 2: performance of the different models averaged over all the forecast horizons

  1. Fit the models to each of the training sets.
  2. For each model fitted to each training set, evaluate the forecast errors on their corresponding test set, separately for each forecast horizon.
  3. For each combination of model and forecast horizon, compute the average of the resulting errors over all the train-test splits.
  4. For each model compute the average over all forecast horizons.

Result: average errors of each model over all the training datasets and forecast horizons.

  • That is, we would obtain 1 result per model (error averaged over all forecast horizons)

Function stretch_tsibble()

Let us move to the practical aspects of how to obtain these multiple train-test splits in R.

The function stretch_tsibble() is used in the fable library to create many training sets to use in a cross-validation context (this library is loaded along with fpp3). Arguments:

stretch_tsibble(.x, .step = 1, .init = 1)

  • .x: tsibble to be split following the time series cross validation strategy.
  • .step: a positive integer specifting the growth rate of the training sets. That is, how many observations at a time are added to the training datasets. Default: .step = 1.
  • .init: a positive integer for an initial size of the training set. Default: .init = 1.

Remember the indications given above about the minimum sizes of the training datasets.

The figure below summarizes the meaning of this arguments applied to one of the examples given before (you can zoom in in the .html file to enlarge it).

Detailed example of time series cross validation

We will use the data for google stock closing prices on 2015:

google_stock <- gafa_stock %>%
  filter(Symbol == "GOOG", year(Date) >= 2015) %>%
  mutate(day = row_number()) %>%
  update_tsibble(index = day, regular = TRUE)
google_2015 <- google_stock %>% filter(year(Date) == 2015)

Step 1. Create the set of training datasets using stretch_tsibble():

google_2015_tr <- google_2015 %>%
  stretch_tsibble(.init = 20, .step = 1) %>%
  relocate(Date, Symbol, .id) # reorder column

google_2015_tr
# A tsibble: 31,688 x 10 [1]
# Key:       Symbol, .id [233]
   Date       Symbol   .id  Open  High   Low Close Adj_Close  Volume   day
   <date>     <chr>  <int> <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <int>
 1 2015-01-02 GOOG       1  526.  528.  521.  522.      522. 1447600     1
 2 2015-01-05 GOOG       1  520.  521.  510.  511.      511. 2059800     2
 3 2015-01-06 GOOG       1  512.  513.  498.  499.      499. 2899900     3
 4 2015-01-07 GOOG       1  504.  504.  497.  498.      498. 2065100     4
 5 2015-01-08 GOOG       1  495.  501.  488.  500.      500. 3353600     5
 6 2015-01-09 GOOG       1  502.  502.  492.  493.      493. 2069400     6
 7 2015-01-12 GOOG       1  492.  493.  485.  490.      490. 2322400     7
 8 2015-01-13 GOOG       1  496.  500.  490.  493.      493. 2370500     8
 9 2015-01-14 GOOG       1  492.  500.  490.  498.      498. 2235700     9
10 2015-01-15 GOOG       1  503.  503.  495.  499.      499. 2715800    10
# ℹ 31,678 more rows
# Number of training datasets generated
google_2015_tr %>% pull(.id) %>% max()
[1] 233

As you can see, 233 training datasets have been generated

  • The column .id is an identifier for each of the training sets.
  • Because .init was set to 20, the first training set has 20 elements
  • Becase .step was set to 1, the training set is increased one observation at a time

Step 2: fit the models to be assessed to the training datasets generated before

The following code fits a drift and a mean model to each of the training datasets! (2 x 233 = 466 models!)

google_2015_fit_cv <- google_2015_tr %>%
                        model(
                              Drift = RW(Close ~ drift()),
                              Mean = MEAN(Close)
                              )

google_2015_fit_cv
# A mable: 233 x 4
# Key:     Symbol, .id [233]
   Symbol   .id         Drift    Mean
   <chr>  <int>       <model> <model>
 1 GOOG       1 <RW w/ drift>  <MEAN>
 2 GOOG       2 <RW w/ drift>  <MEAN>
 3 GOOG       3 <RW w/ drift>  <MEAN>
 4 GOOG       4 <RW w/ drift>  <MEAN>
 5 GOOG       5 <RW w/ drift>  <MEAN>
 6 GOOG       6 <RW w/ drift>  <MEAN>
 7 GOOG       7 <RW w/ drift>  <MEAN>
 8 GOOG       8 <RW w/ drift>  <MEAN>
 9 GOOG       9 <RW w/ drift>  <MEAN>
10 GOOG      10 <RW w/ drift>  <MEAN>
# ℹ 223 more rows

The resulting mable has, as expected:

  • 233 rows, because there are a total of 233 training datasets (think of each training dataset as a separate time series)
  • The combination of the columns Symbol and .id indicate to which training dataset the model has been fit.
  • The columns Drift and Mean (user choices to name the models) correspond to the models fitted to each of these training datasets.

Step 3: generate forecasts

Let us generate forecasts of a horizon of up to 8 training days for each of those models:

# TSCV accuracy
google_2015_fc_cv <- google_2015_fit_cv %>% forecast(h = 8)

google_2015_fc_cv
# A fable: 3,728 x 6 [1]
# Key:     Symbol, .id, .model [466]
   Symbol   .id .model   day        Close .mean
   <chr>  <int> <chr>  <dbl>       <dist> <dbl>
 1 GOOG       1 Drift     21  N(532, 102)  532.
 2 GOOG       1 Drift     22  N(533, 213)  533.
 3 GOOG       1 Drift     23  N(533, 335)  533.
 4 GOOG       1 Drift     24  N(534, 467)  534.
 5 GOOG       1 Drift     25  N(534, 609)  534.
 6 GOOG       1 Drift     26  N(535, 762)  535.
 7 GOOG       1 Drift     27  N(535, 924)  535.
 8 GOOG       1 Drift     28 N(536, 1097)  536.
 9 GOOG       1 Mean      21  N(510, 220)  510.
10 GOOG       1 Mean      22  N(510, 220)  510.
# ℹ 3,718 more rows

The output of the forecast table (fable) is, as expected:

  • 8 forecasts for each combination of Symbol, .id and .model. Therefore a total of 8 x 2 x 233 = 3278 point forecasts (column .mean) and their corresponding forecast distributions (column Close).

Step 4: generate cross-validated accuracy metrics:

4.1 Option 1: generate accuracy metrics for each combination of model and forecast horizon.

To achieve this, we first need to add an additional column to the foregoing forecast table to identify to which horizon each forecast corresponds.

  • That is, we need a row counter that resets everytime the combination of .id and model changes.
  • Pay attention to the line as_fable(response = "Close", distribution = Close). This line ensures that, after performung group_by() followed by mutate() and ungroup(), the result of the operation is still a fable (forecast table).
    • This is very important for the function àccuracy() to work properly in the subsequent steps.
google_2015_fc_cv <- google_2015_fc_cv %>%
  group_by(.id, .model) %>%
  mutate(h = row_number()) %>%
  ungroup() %>%
  as_fable(response = "Close", distribution = Close) %>% 
  select(h, everything()) # Reorder columns

google_2015_fc_cv
# A fable: 3,728 x 7 [1]
# Key:     Symbol, .id, .model [466]
       h Symbol   .id .model   day        Close .mean
   <int> <chr>  <int> <chr>  <dbl>       <dist> <dbl>
 1     1 GOOG       1 Drift     21  N(532, 102)  532.
 2     2 GOOG       1 Drift     22  N(533, 213)  533.
 3     3 GOOG       1 Drift     23  N(533, 335)  533.
 4     4 GOOG       1 Drift     24  N(534, 467)  534.
 5     5 GOOG       1 Drift     25  N(534, 609)  534.
 6     6 GOOG       1 Drift     26  N(535, 762)  535.
 7     7 GOOG       1 Drift     27  N(535, 924)  535.
 8     8 GOOG       1 Drift     28 N(536, 1097)  536.
 9     1 GOOG       1 Mean      21  N(510, 220)  510.
10     2 GOOG       1 Mean      22  N(510, 220)  510.
# ℹ 3,718 more rows

As you can see now the column h identifies to which forecast horizon each forecast corresponds. It resets every time the value of the column .id or .model changes, because in the preceding coude we had performed group_by(.id, .model).

Once this has been done, we can use the accuracy() function to compute the metrics for each combination of .model and h giving it an additional argument .by that specifies the aggregation level for the accuracy metrics:

summary_cv_h <- google_2015_fc_cv %>% 
  accuracy(google_2015, by = c(".model", "h", "Symbol"))
Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
8 observations are missing between 253 and 260

In the preceding code, pay attention to the line accuracy(google_2015, by = c(".model", "h", "Symbol")):

  • We gave google_2015 as an argument to accuracy. This is the original dataset, prior to applying the function stretch_tsibble to generate the training datasets.
  • by = c(".model", "h", "Symbol") specifies the desired aggregation level for the accuracy metrics.
    • NOTE: in this example, since we are deailing with a single time series, we could omit the Key identifier of this time series (Symbol), because it only takes one value (GOOG). But if you have a dataset that contains multiple time series, you will need to include it.

Let us inspect the result:

summary_cv_h
# A tibble: 16 × 12
   .model     h Symbol .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
   <chr>  <int> <chr>  <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
 1 Drift      1 GOOG   Test   0.430  11.3  7.17 0.0515  1.16  1.01  1.01 0.0859
 2 Drift      2 GOOG   Test   0.951  16.8 10.6  0.113   1.72  1.49  1.50 0.504 
 3 Drift      3 GOOG   Test   1.51   20.5 13.3  0.181   2.17  1.87  1.83 0.669 
 4 Drift      4 GOOG   Test   2.04   23.4 15.3  0.249   2.47  2.14  2.09 0.747 
 5 Drift      5 GOOG   Test   2.51   25.8 16.8  0.307   2.73  2.36  2.31 0.789 
 6 Drift      6 GOOG   Test   2.98   27.9 18.4  0.367   2.98  2.58  2.49 0.827 
 7 Drift      7 GOOG   Test   3.47   29.9 20.1  0.431   3.25  2.82  2.67 0.835 
 8 Drift      8 GOOG   Test   3.93   31.5 21.1  0.490   3.41  2.96  2.82 0.857 
 9 Mean       1 GOOG   Test  60.5    85.5 62.6  8.87    9.26  8.78  7.64 0.976 
10 Mean       2 GOOG   Test  61.1    86.1 63.2  8.95    9.35  8.87  7.70 0.976 
11 Mean       3 GOOG   Test  61.7    86.7 63.8  9.04    9.44  8.95  7.75 0.976 
12 Mean       4 GOOG   Test  62.3    87.4 64.4  9.13    9.53  9.04  7.81 0.976 
13 Mean       5 GOOG   Test  62.9    88.0 65.1  9.22    9.63  9.13  7.86 0.976 
14 Mean       6 GOOG   Test  63.5    88.6 65.6  9.31    9.71  9.21  7.92 0.976 
15 Mean       7 GOOG   Test  64.1    89.2 66.2  9.40    9.80  9.29  7.97 0.977 
16 Mean       8 GOOG   Test  64.7    89.8 66.8  9.48    9.88  9.37  8.02 0.977 

As you can see, the result details the accuracy metrics (columns ME to ACF1) for each combination of .model and .h. That is, for each combination of model and forecast horizon.

The output above contains, as expected:

  • 16 rows, one row for each combination of model (.model) and forecast horizon (h)
  • One column for each of the accuracy metrics returned by accuracy().
    • In this course, we will limit ourselves to the ones explained in class.

Each of the rows above shows the average error metrics over all the training datasets (233 training sets) of each models fitted, separated by forecast horizon. Some examples:

  • The row corresponding to h = 1 and model = Drift contains the average error metrics (MAE, RMSE…) of each of the 233 Drift models that were fitted to their corersponding training dataset (233 training sets) when forecasting one step ahead on their corresponding test datasets.
  • The row corresponding to h = 4 and model = Drift contains the average error metrics (MAE, RMSE…) of each of the 233 Drift models that were fitted to their corresponding training dataset (233 training sets) when forecasting the value four steps ahead on their corresponding test datasets.

This is a much more statistically robust method than just performing a single train-test split.

A possible good way to choose the best performing forecasting model: find the model with the smallest RMSE for the forecast horizon we wish to compute using time series cross-validation. As explained in the session on point forecast accuracy, minimizing RMSE will lead to forecasts on the mean.

summary_cv_h %>% 
  ggplot(aes(x = h, y = RMSE, color = .model)) +
  geom_point()

As you can see the graph indicates that the average performance of the drift model over all the train-test splits is much better (smaller errors) than that of the mean method for this particular time series for all the forecast horizons analyzed.

4.2 Option 2: generate average accuracy metrics for each model averaged over all the forecasting horizons.

Perhaps we are not interested in producing forecasts with a specific horizon, but rather we would like a model that returns the best forecasts on average over all the horizons.

To attain this, we use again the accuracy() function but we do not provide an argument for by. It will therefore default to grouping solely by .model and the Key identifier of the time series. Let us check this:

google_2015_fc_cv %>% 
  accuracy(google_2015)
Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
8 observations are missing between 253 and 260
# A tibble: 2 × 11
  .model Symbol .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Drift  GOOG   Test   2.21  24.2  15.3 0.272  2.48  2.15  2.16 0.814
2 Mean   GOOG   Test  62.6   87.7  64.7 9.17   9.57  9.08  7.83 0.966

This is equivalent to doing:

google_2015_fc_cv %>% 
  accuracy(google_2015, by = c(".model", "Symbol"))
Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
8 observations are missing between 253 and 260
# A tibble: 2 × 11
  .model Symbol .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Drift  GOOG   Test   2.21  24.2  15.3 0.272  2.48  2.15  2.16 0.814
2 Mean   GOOG   Test  62.6   87.7  64.7 9.17   9.57  9.08  7.83 0.966

The above metrics are the average performance metrics of each of the model types over all the train-test splits and forecast horizons. For example:

  • The row corresponding to the Drift contains the error metrics of the drift model averaged over the 233 train-test splits and over all the forecast horizons.

Exercise cross-validation

Task

Perform cross validation on the australian beer production dataset. Use the dataset recent_production created during this lesson.

Step 0. Identify your dataset and think about which variables you need

  • We will work with recent_production, specifically with the production of beer, which we will store in beer_production. We will apply stretch_tsibble() to this time series to generate our set training datasets.
  • For convenience I will select only the varialbes relevant to the analysis
recent_production <- aus_production %>%
  filter(year(Quarter) >= 1992)
beer_production <- select(recent_production, Quarter, Beer)
beer_production
# A tsibble: 74 x 2 [1Q]
   Quarter  Beer
     <qtr> <dbl>
 1 1992 Q1   443
 2 1992 Q2   410
 3 1992 Q3   420
 4 1992 Q4   532
 5 1993 Q1   433
 6 1993 Q2   421
 7 1993 Q3   410
 8 1993 Q4   512
 9 1994 Q1   449
10 1994 Q2   381
# ℹ 64 more rows

Step 1. Create the set of training datasets using stretch_tsibble()

Step 2: Fit the models

Step 3: generate the forecasts

Consider a forecast horizon of up to 4.

Step 4: generate cross-validated accuracy metrics

4.1 Option 1: generate accuracy metrics for each combination of model and forecast horizon.

4.2 Option 2: generate average accuracy metrics for each model averaged over all the forecasting horizons.